Execute: Cmd+Shift+Enter. New Chunk: Cmd+Option+I. Preview: Cmd+Shift+K Notes: The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

Configuration

library(leaps)
library(modelr)
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-26. For overview type 'help("mgcv-package")'.
library(ggplot2)
library(DataExplorer)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
## 
##     collapse
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(ggExtra)
library(caret)
## Loading required package: lattice
library(reshape2)
library(RColorBrewer)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(corrplot)
## corrplot 0.84 loaded

System Information

Due to the large number of libraries in use I have provided system information.

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS/LAPACK: /anaconda3/lib/R/lib/libRblas.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] corrplot_0.84      plotly_4.8.0       RColorBrewer_1.1-2
##  [4] reshape2_1.4.3     caret_6.0-81       lattice_0.20-38   
##  [7] ggExtra_0.8        GGally_1.4.0       dplyr_0.8.0.1     
## [10] DataExplorer_0.7.0 ggplot2_3.1.0      mgcv_1.8-26       
## [13] nlme_3.1-137       modelr_0.1.3       leaps_3.0         
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.0         tidyr_0.8.2        viridisLite_0.3.0 
##  [4] jsonlite_1.6       splines_3.5.1      foreach_1.4.4     
##  [7] prodlim_2018.04.18 shiny_1.2.0        assertthat_0.2.0  
## [10] stats4_3.5.1       yaml_2.2.0         ipred_0.9-8       
## [13] pillar_1.3.1       backports_1.1.3    glue_1.3.0        
## [16] digest_0.6.18      promises_1.0.1     colorspace_1.4-0  
## [19] recipes_0.1.4      htmltools_0.3.6    httpuv_1.4.5.1    
## [22] Matrix_1.2-15      plyr_1.8.4         timeDate_3043.102 
## [25] pkgconfig_2.0.2    broom_0.5.1        purrr_0.2.5       
## [28] xtable_1.8-3       scales_1.0.0       later_0.8.0       
## [31] gower_0.1.2        lava_1.6.5         tibble_2.0.1      
## [34] generics_0.0.2     withr_2.1.2        nnet_7.3-12       
## [37] lazyeval_0.2.1     survival_2.43-3    magrittr_1.5      
## [40] crayon_1.3.4       mime_0.6           evaluate_0.12     
## [43] MASS_7.3-51.1      class_7.3-15       tools_3.5.1       
## [46] data.table_1.12.0  stringr_1.4.0      munsell_0.5.0     
## [49] networkD3_0.4      compiler_3.5.1     rlang_0.3.1       
## [52] grid_3.5.1         iterators_1.0.10   htmlwidgets_1.3   
## [55] igraph_1.2.4       miniUI_0.1.1.1     rmarkdown_1.11    
## [58] gtable_0.2.0       ModelMetrics_1.1.0 codetools_0.2-16  
## [61] reshape_0.8.8      R6_2.4.0           gridExtra_2.3     
## [64] lubridate_1.7.4    knitr_1.21         stringi_1.3.1     
## [67] parallel_3.5.1     Rcpp_1.0.0         rpart_4.1-13      
## [70] tidyselect_0.2.5   xfun_0.4
sapply(c('repr', 'IRdisplay', 'IRkernel'), function(p) paste(packageVersion(p)))
##      repr IRdisplay  IRkernel 
##  "0.19.2"   "0.7.0"  "0.8.15"

Data

First we inspect the raw records using bash:

head -n 5 data/Carseats_org.csv
## "","Sales","CompPrice","Income","Advertising","Population","Price","ShelveLoc","Age","Education","Urban","US"
## "1",9.5,138,73,11,276,120,"Bad",42,17,"Yes","Yes"
## "2",11.22,111,48,16,260,83,"Good",65,10,"Yes","Yes"
## "3",10.06,113,35,10,269,80,"Medium",59,12,"Yes","Yes"
## "4",7.4,117,100,4,466,97,"Medium",55,14,"Yes","Yes"

Now I load the data into r, and drop the “ID” column.

carseats <- read.csv("data/Carseats_org.csv", header = T, stringsAsFactors = T)

drops <- c("X")
carseats <- carseats[ , !(names(carseats) %in% drops)]
head(carseats, 10)

Here I create two new dataframes to manage numeric and categorical data.

# get vectors of continuous and categorical cols
nums <- dplyr::select_if(carseats, is.numeric)
cats <- dplyr::select_if(carseats, is.factor)

nums[sample(nrow(nums), 10), ]
cats[sample(nrow(cats), 10), ]

Let’s get quick summaries of each:

summary(nums)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price            Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8   Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0   3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0   Max.   :80.00   Max.   :18.0
summary(cats)
##   ShelveLoc   Urban       US     
##  Bad   : 96   No :118   No :142  
##  Good  : 85   Yes:282   Yes:258  
##  Medium:219
str(nums)
## 'data.frame':    400 obs. of  8 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : int  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : int  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: int  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : int  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : int  120 83 80 97 128 72 108 120 124 124 ...
##  $ Age        : int  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : int  17 10 12 14 13 16 15 10 10 17 ...
str(cats)
## 'data.frame':    400 obs. of  3 variables:
##  $ ShelveLoc: Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Urban    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US       : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Data Dimensionality

This command is to inspect the different data types in the data.

str(carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : int  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : int  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: int  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : int  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : int  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : int  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : int  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
print("")
## [1] ""
print(paste('Number of Columns:', ncol(carseats)))
## [1] "Number of Columns: 11"
print(paste('Number of Numeric Columns:', ncol(nums)))
## [1] "Number of Numeric Columns: 8"
print(paste('Number of Categorical Columns:', ncol(cats)))
## [1] "Number of Categorical Columns: 3"

Here’s a quick way to examine general properties of the data:

DataExplorer::introduce(data=carseats)

Numeric Plotting

I start out with a few general scatter plots.

plot_ly(data=carseats, 
        x=~Age, 
        y=~Sales, 
        mode = 'markers', 
        type = 'scatter', 
        color=~ShelveLoc) %>%
            layout(title = "Age, Shelf Location, and Sales Scatter Plot", width=900) 
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

This plot below shows good separation and a weak linear trend. These variables are worth investigating.

plot_ly(data=carseats, 
        x=~Price, 
        y=~Sales, 
        mode = 'markers', 
        type = 'scatter', 
        color=~ShelveLoc) %>%
            layout(title = "Price, Shelf Location, and Sales Scatter Plot", width=900)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

Here we inspect the density of the ‘Price vs Sales’ relationship:

plot_ly(data=carseats, 
        x=~Price, 
        y=~Sales, 
        mode = 'markers', 
        size = ~Price,
        type = 'scatter',
        colors = "Dark2",
        alpha = .6) %>%
            layout(title = "Price, US, and Sales Scatter Plot", width=900) 
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
## Warning: `line.width` does not currently support multiple values.

Normalization

I choose to normalize the numeric data in order to be able to plot each variable on the same scale. This will allow me to investigate the variation of each predictor relaticve to Sales.

preObj <- preProcess(nums, method=c("center", "scale"))
scaled.nums <- predict(preObj, nums)

head(scaled.nums)
str(scaled.nums)
## 'data.frame':    400 obs. of  8 variables:
##  $ Sales      : num  0.7095 1.3185 0.9078 -0.0341 -1.1849 ...
##  $ CompPrice  : num  0.849 -0.911 -0.781 -0.52 1.045 ...
##  $ Income     : num  0.155 -0.738 -1.203 1.12 -0.166 ...
##  $ Advertising: num  0.656 1.408 0.506 -0.396 -0.547 ...
##  $ Population : num  0.0757 -0.0328 0.0282 1.3649 0.51 ...
##  $ Price      : num  0.178 -1.385 -1.512 -0.794 0.515 ...
##  $ Age        : num  -0.699 0.721 0.35 0.104 -0.946 ...
##  $ Education  : num  1.183 -1.4882 -0.725 0.0382 -0.3434 ...
print("")
## [1] ""
summary(scaled.nums)
##      Sales            CompPrice            Income        
##  Min.   :-2.65440   Min.   :-3.12856   Min.   :-1.70290  
##  1st Qu.:-0.74584   1st Qu.:-0.65049   1st Qu.:-0.92573  
##  Median :-0.00224   Median : 0.00163   Median : 0.01224  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.64575   3rd Qu.: 0.65375   3rd Qu.: 0.79834  
##  Max.   : 3.10670   Max.   : 3.26225   Max.   : 1.83458  
##   Advertising        Population           Price         
##  Min.   :-0.9977   Min.   :-1.72918   Min.   :-3.87702  
##  1st Qu.:-0.9977   1st Qu.:-0.85387   1st Qu.:-0.66711  
##  Median :-0.2459   Median : 0.04858   Median : 0.05089  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.8067   3rd Qu.: 0.90693   3rd Qu.: 0.64219  
##  Max.   : 3.3630   Max.   : 1.65671   Max.   : 3.17633  
##       Age             Education       
##  Min.   :-1.74827   Min.   :-1.48825  
##  1st Qu.:-0.83779   1st Qu.:-0.72504  
##  Median : 0.07268   Median : 0.03816  
##  Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.78255   3rd Qu.: 0.80137  
##  Max.   : 1.64673   Max.   : 1.56457

Distributions

Here are scaled distributions (histograms and density plots) for each numeric variable, including Sales. The variables relating to money ($) tend to be approximately normal. Many other variables tend to be approximately uniform, which does not bode well for their predictive power.

scaled.nums %>%
    tidyr::gather() %>% 
        ggplot(aes(x=value,y=..density..))+

            ggtitle('Distributions of Continous Variables (scaled)') +

            facet_wrap(~ key, scales = "free") +
            geom_histogram(fill=I("orange"), col=I("black"), bins = 50) +

            facet_wrap(~ key, scales = "free") +
            geom_density(color="blue", fill='light blue', alpha = 0.4)

Here we plot all numeric variables against their distributions. This is just another way to examine the information shown above.

scaled.nums %>%
    tidyr::gather() %>% 
        plot_ly(x=~key, y=~value,
                type = "box", 
                boxpoints = "all", 
                jitter = 0.4,
                pointpos = 0,
                color = ~key,
                colors = "Dark2") %>%
                      subplot(shareX = TRUE)  %>%
                            layout(title = "Numeric Variable Distributions (scaled)", 
                                  yaxis=list(title='Standard Deviation'),
                                  xaxis=list(title='Variable'),
                                  autosize=FALSE,
                                  width=900,
                                  height=500)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

Scatterplots

Here we plot all numeric variables against Sales (scaled). This allows us to investigate possible linear relationships between that variable and Sales. As shown below, only ‘Price’ appears to have a linear relationshop worth investigating.

scaled.nums %>%
    tidyr::gather(-Sales, key = "var", value = "value") %>%
    split(.$var) %>%
       lapply(function(d) plot_ly(d, x=~value, y=~Sales, 
                mode = 'markers', 
                type = 'scatter',
                colors = 'Set3',
                size = ~Sales^2,
                marker = list(line = list(
                color = 'black',
                width = 2)),
                name=~var)) %>%
                      subplot(nrows=7, shareY=TRUE)  %>%
                            layout(title = "Scatterplot Numeric vs Sales (scaled)<br>Size=Sales^2",
                                   autosize=FALSE,
                                   width=900,
                                   height=3000,
                                   yaxis=list(title='Sales'))
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

try

By adding naive regression lines to a few scatterplots we can confirm our suspicions:

fit.Pop <- lm(Sales ~ Population, data = scaled.nums)
fit.Age <- lm(Sales ~ Age, data = scaled.nums)
fit.CompPrice <- lm(Sales ~ CompPrice, data = scaled.nums)
fit.Price <- lm(Sales ~ Price, data = scaled.nums)

p1 <-  plot_ly(scaled.nums, 
               x = ~Population,
               name = 'Population vs Sales Regression Line') %>% 
                   add_markers(y = ~Sales,
                               name = 'Population vs Sales Observations') %>% 
                                    add_lines(x = ~Population, 
                                              y = fitted(fit.Pop)) 
p2 <-  plot_ly(scaled.nums, 
               x = ~Age,
               name = 'Age vs Sales Regression Line') %>% 
                   add_markers(y = ~Sales,
                               name = 'Age vs Sales Observations') %>% 
                                    add_lines(x = ~Age, 
                                              y = fitted(fit.Age)) 

p3 <-  plot_ly(scaled.nums, 
               x = ~CompPrice,
               name = 'CompPrice vs Sales Regression Line') %>% 
                   add_markers(y = ~Sales,
                               name = 'CompPrice vs Sales Observations') %>% 
                                    add_lines(x = ~CompPrice, 
                                              y = fitted(fit.CompPrice)) 

p4 <-  plot_ly(scaled.nums, 
               x = ~Price,
               name = 'Price vs Sales Regression Line') %>% 
                   add_markers(y = ~Sales,
                               name = 'Price vs Sales Observations') %>% 
                                    add_lines(x = ~Price, 
                                              y = fitted(fit.Price)) 

subplot(p1, p2, p3, p4, nrows=2, shareY=TRUE) %>%
    layout(title = "Scatterplot Numeric vs Sales (scaled) <br> With Regression Lines",
           autosize=FALSE,
           width=900,
           height=500,
           yaxis=list(title='Sales'))
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

Let’s compare the slopes:

scaled.nums %>%
    plot_ly(y = ~Sales) %>%
      add_lines(x= ~Population, y = fitted(fit.Pop), 
                name = "fit.Pop slope", line = list(shape = "linear")) %>%
      add_lines(x= ~Age, y = fitted(fit.Age), 
                name = "fit.Age slope", line = list(shape = "linear")) %>%
      add_lines(x= ~CompPrice, y = fitted(fit.CompPrice), 
                name = "fit.CompPrice slope", line = list(shape = "linear")) %>%
      add_lines(x= ~Price, y = fitted(fit.Price), 
                name = "fit.Price slope", line = list(shape = "linear")) %>%
            
    layout(title = "Regression Lines vs Sales (scaled)",
           autosize=FALSE,
           width=700,
           height=500,
           yaxis=list(title='Sales'),
           xaxis=list(title='Scaled Numeric Variable'))
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

Here’s a pretty graphic that doesn’t help me understand anything about the data.

y = scaled.nums$Sales
x = scaled.nums$Price

s <- subplot(
  plot_ly(x = x, color = I("black"), type = 'histogram'), 
  plotly_empty(), 
  plot_ly(x = x, y = y, type = 'histogram2dcontour', showscale = F), 
  plot_ly(y = y, color = I("black"), type = 'histogram'),
  nrows = 2, heights = c(0.2, 0.8), widths = c(0.8, 0.2),
  shareX = TRUE, shareY = TRUE, titleX = FALSE, titleY = FALSE)
## Warning: No trace type specified and no positional attributes specified
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
layout(s, showlegend = FALSE, autosize=FALSE,
           width=700,
           height=500, 
           yaxis=list(title='Sales'),
           xaxis=list(title='Price'))
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

Categorical Plotting

First, let’s create a dataframe that we can use:

categorical.by.sales = cbind(Sales = scaled.nums$Sales, cats)

Let’s try ’em all at once. This works, but I don’t manage the colors very well.

colr = brewer.pal(9, "Set1")

categorical.by.sales %>%
    tidyr::gather(-Sales, key = "var", value = "value") %>%
    split(.$var) %>%
       lapply(function(d) plot_ly(d, x=~paste(var, '<br>' , value), y=~Sales,
                type = "box", 
                boxpoints = "all", 
                jitter = .2,
                pointpos = 0,
                color =~paste(var, ":", value),
                colors = sample(colr, length(unique(d$value))))) %>%
#                 colors = "Set1")) %>%
                      subplot(shareY = TRUE, nrows=3)  %>%
                            layout(title = "Categorical Variable Distributions vs Sales (scaled)", 
                                  yaxis=list(title='Sales Standard Deviation'),
                                  xaxis=list(title=''),
                                  autosize=FALSE,
                                  width=900,
                                  height=1500)
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

I run these each separately and get much nicer plots.

categorical.by.sales %>%               
    plot_ly(x = ~US, y = ~Sales,
        split = ~US,
        type = 'violin',
        box = list(visible = TRUE),
        meanline = list(visible = TRUE)) %>% 
              layout(xaxis = list(title = "US"),
                     yaxis = list(title = "Sales",zeroline = FALSE))
categorical.by.sales %>%               
    plot_ly(x = ~Urban, y = ~Sales,
        split = ~Urban,
        type = 'violin',
        box = list(visible = TRUE),
        meanline = list(visible = TRUE)) %>% 
              layout(xaxis = list(title = "Urban"),
                     yaxis = list(title = "Sales",zeroline = FALSE))

Now this is interesting. We suspected that ‘ShelveLoc’ would be important based on one of the early scatterplots. It seems that this is the case.

categorical.by.sales %>%               
    plot_ly(x = ~ShelveLoc, y = ~Sales,
        split = ~ShelveLoc,
        type = 'violin',
        box = list(visible = TRUE),
        meanline = list(visible = TRUE)) %>% 
              layout(xaxis = list(title = "ShelveLoc"),
                     yaxis = list(title = "Sales",zeroline = FALSE))

Linear Regression

First, let’s merge the data set into a single dataframe

scaled.merged <- merge(scaled.nums, categorical.by.sales, by="Sales")

head(scaled.merged)

First, let’s look at some things that may give us trouble. Luckily it looks like the only serious correlation is with our dependent variable.

res <- cor(scaled.nums)

corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

It appears that residuals are roughly symmetrical around 0. That’s strange. Mostly due to a relatively poor overall fit. Note how close to zero most of the coefficient estimates are.

simple.lm <- lm(Sales~., data=scaled.merged)
summary(simple.lm)
## 
## Call:
## lm(formula = Sales ~ ., data = scaled.merged)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32467 -0.29881 -0.01677  0.27320  1.82851 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.70000    0.06018 -11.633  < 2e-16 ***
## CompPrice        0.44976    0.02515  17.883  < 2e-16 ***
## Income           0.11240    0.02029   5.539 4.78e-08 ***
## Advertising      0.24069    0.02389  10.076  < 2e-16 ***
## Population       0.01656    0.02134   0.776    0.438    
## Price           -0.70235    0.02530 -27.759  < 2e-16 ***
## Age             -0.23287    0.02012 -11.576  < 2e-16 ***
## Education       -0.02619    0.02029  -1.291    0.197    
## ShelveLocGood    1.39901    0.06043  23.152  < 2e-16 ***
## ShelveLocMedium  0.59700    0.04974  12.003  < 2e-16 ***
## UrbanYes         0.03785    0.04353   0.870    0.385    
## USYes            0.06868    0.04773   1.439    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4672 on 538 degrees of freedom
## Multiple R-squared:  0.7543, Adjusted R-squared:  0.7492 
## F-statistic: 150.1 on 11 and 538 DF,  p-value: < 2.2e-16

Interaction Terms

Here we define a new model with some interaction terms: a. Income and Advertising b. Income and CompPrice c. Price and Age

interaction.lm <- lm(Sales~. + Income*Advertising + Income*CompPrice + Price*Age, data=scaled.merged)
summary(interaction.lm)
## 
## Call:
## lm(formula = Sales ~ . + Income * Advertising + Income * CompPrice + 
##     Price * Age, data = scaled.merged)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31721 -0.30723 -0.01874  0.28583  1.91559 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.69732    0.05993 -11.635  < 2e-16 ***
## CompPrice           0.44981    0.02500  17.991  < 2e-16 ***
## Income              0.10889    0.02028   5.369 1.18e-07 ***
## Advertising         0.23790    0.02377  10.008  < 2e-16 ***
## Population          0.01367    0.02121   0.645   0.5194    
## Price              -0.70114    0.02513 -27.903  < 2e-16 ***
## Age                -0.23142    0.02003 -11.553  < 2e-16 ***
## Education          -0.03038    0.02024  -1.501   0.1339    
## ShelveLocGood       1.39095    0.06040  23.028  < 2e-16 ***
## ShelveLocMedium     0.58444    0.04973  11.752  < 2e-16 ***
## UrbanYes            0.03659    0.04323   0.846   0.3977    
## USYes               0.07076    0.04740   1.493   0.1360    
## Income:Advertising  0.03748    0.01999   1.875   0.0613 .  
## CompPrice:Income   -0.05181    0.02053  -2.524   0.0119 *  
## Price:Age           0.01399    0.01991   0.703   0.4826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4637 on 535 degrees of freedom
## Multiple R-squared:  0.7593, Adjusted R-squared:  0.753 
## F-statistic: 120.5 on 14 and 535 DF,  p-value: < 2.2e-16

Variable Significance

  1. Which variables ( and interaction pairs ) are significant ? Do you find anything surprising or counterintuitive in the results? Describe any response variable or lack of response that seems surprising.

  2. Repeat the above but using the “leaps” package to perform subset selection, using the full model using ONLY the following variables:

Price Population CompPrice Income Education Age Advertising ShelveLoc

Plus the following interaction term: Income and Age Income and Advertising Income and CompPrice

Use the regsubsets command in “leaps” to do the full model ( do NOT specify really.big=T as it will require a long time to complete, along with forward and backward subset selection. Use the R file, Subset_select.R as a guide for syntax.

  1. Use the following model selection criteria to determine which of the resulting models is “best” according to RSS, adjusted R squared, Mallow’s Cp, and BIC.

The result of the regressions is a model object, so you can query the proper coefficient of the model to determine which model, of the multiple models fit, produces the best metric.

Create a table, or simply list, for the full model, and forward and backward selection, which model # is the best, then write out the model, that is put the model in the form y=intercept + coefficient*variable, for the best model using the BIC measure. NOTE: these may be the same model, or may differ slightly.